EasyVisa Project

Context:

Business communities in the United States are facing high demand for human resources, but one of the constant challenges is identifying and attracting the right talent, which is perhaps the most important element in remaining competitive. Companies in the United States look for hard-working, talented, and qualified individuals both locally as well as abroad.

The Immigration and Nationality Act (INA) of the US permits foreign workers to come to the United States to work on either a temporary or permanent basis. The act also protects US workers against adverse impacts on their wages or working conditions by ensuring US employers' compliance with statutory requirements when they hire foreign workers to fill workforce shortages. The immigration programs are administered by the Office of Foreign Labor Certification (OFLC).

OFLC processes job certification applications for employers seeking to bring foreign workers into the United States and grants certifications in those cases where employers can demonstrate that there are not sufficient US workers available to perform the work at wages that meet or exceed the wage paid for the occupation in the area of intended employment.

Objective:

In FY 2016, the OFLC processed 775,979 employer applications for 1,699,957 positions for temporary and permanent labor certifications. This was a nine percent increase in the overall number of processed applications from the previous year. The process of reviewing every case is becoming a tedious task as the number of applicants is increasing every year.

The increasing number of applicants every year calls for a Machine Learning based solution that can help in shortlisting the candidates having higher chances of VISA approval. OFLC has hired your firm EasyVisa for data-driven solutions. You as a data scientist have to analyze the data provided and, with the help of a classification model:

Data Description

The data contains the different attributes of the employee and the employer. The detailed data dictionary is given below.

Importing necessary libraries and data

Data Overview

First Look

Exploratory Data Analysis (EDA)

There are 33 values that are negative. They range from -26 to -11. We will take the absolute value of the column.

Initial EDA Observations:

We will keep the outliers and we will use Decision Trees in our modeling since Decision Trees are not sensitive to outliers

Let's Visualize the Data Further

Leading Questions:

  1. Those with higher education may want to travel abroad for a well-paid job. Does education play a role in Visa certification?

  2. How does the visa status vary across different continents?

  3. Experienced professionals might look abroad for opportunities to improve their lifestyles and career development. Does work experience influence visa status?

  4. In the United States, employees are paid at different intervals. Which pay unit is most likely to be certified for a visa?

  5. The US government has established a prevailing wage to protect local talent and foreign workers. How does the visa status change with the prevailing wage?

1. Those with higher education may want to travel abroad for a well-paid job. Does education play a role in Visa certification?

Observation:

Yes! The higher the education, the higher percentage of Visa certifications.

2. How does the visa status vary across different continents?

Observation:

3. Experienced professionals might look abroad for opportunities to improve their lifestyles and career development. Does work experience influence visa status?

Observation:

4. In the United States, employees are paid at different intervals. Which pay unit is most likely to be certified for a visa?

Observation:

5. The US government has established a prevailing wage to protect local talent and foreign workers. How does the visa status change with the prevailing wage?

Observation:

Observation

The rate of certification does not seem to be influenced by whether or not an employee requires job training

Observation

The rate of certification is also the same whether or not the position is full time

Observation

There seems to be a higher rate of certification for employees going to the Midwest region

Observation

Observation

The prevailing wage average is a little higher in the Midwest and Island regions.

Observation

The average prevailing wage for Year, Week, and Month are nearly the same. Only hourly wages look different.

Observation:

The prevailing wage for Hour ranges from 0 to 1000. The other units have far greater ranges.

Data Preprocessing

Summary of Data Processing

In this 1st level of processing, we one-hot encoded the object columns as well as changed the education_of_employee to ordinal numbers.

EDA

Observations of Second EDA

Top correlations for case_status:

Bottom correlations for case_status:

If we kept all the dummy variables instead of dropping the first column, we could see more correlations here.

Split the data and write performance function

Building bagging and boosting models

Decision Tree Classifier

Observation on first Decision Tree model

This first Decision Tree model using defaults is definitely overfitting the training set. An accuracy of 65% is not very good. We can do better than an F1 Score of 74%.

Top 5 Features:

The top 4 features have numerical ranges whereas the rest of the features are binary (0 or 1). Is this surprising? Not really. Dummy variables alone have little weight, but can be more deterministic when combined with other dummy variables. It makes sense that features with larger ranges could be divided more times in decision trees and so interact with other variables more often.

A higher prevailing wage might increase likelihood of certification overall, but a lower prevailing wage being approved might be just as likely if all the other boxes are checked off.

Bottom 5 Features:

It's quite possible that continent is a biased variable. It might be true that more certifications are given to one continent over another, but perhaps we don't want to predict Visas in the future based on continent.

Let's make 2 more Decision Trees using the 2 other datasets that are slightly engineered differently.

Initial Decision Tree Observations using all 3 training sets:

Decision Tree Classifier with GridSearchCV

Observation of GridSearchCV on Decision Tree

The model was tuned to find the best F1 Score, which it did, but unfortuneatly, this tree is way too simple with a max depth of 2. The training and testing scores are the same, and there are no important features to analyze. This model is not stable.

Bagging Classifier

Bagging Classifier with GridSearchCV

Observation of Bagging

Both Bagging models were overfitting, but the F1 scores on the test sets are improving.

Random Forest Classifier

Observation of Default Random Forest

Model is overfitting.

Random Forest Classifier with GridSearchCV

Finally, the Random Forest model using GridSearchCV has reduced overfitting.

Let's see if it has also reduced feature bias.

Observation

We see that the original numerical columns such as no_of_employees and yr_of_estab have moved down and lost importance, while education_of_employee has moved to the top!

This means our Random Forest is making better aggregated predictions on random variables; the importance of features is spreading out.

Let's build Random Forests using GridSearch on the 2nd and 3rd training sets now to see how the feature importances change.

Observation of Tuned Random Forest models on the 3 slightly altered datasets

Can we find a more stable model with an F1 Score higher than 79%?

AdaBoost Classifier

AdaBoost Classifier with GridSearchCV

Obersvation of AdaBoost (Default and Tuned)

Gradient Boosting Classifier

Observation on Gradient Boosting on 3 different training sets

Can we do better than 81.8% F1 Score?

Gradient Boosting Classifier with GridSearchCV

Observation of Tuned Gradient Boost Model

Top 5 Important Features

XGBoost Classifier

XGBoost Classifier with GridSearchCV

Observation of XGBoost

Stacking Classifier

Observation of Stacking

Model Performance Comparison and Conclusions

Observations

This final graph shows the performance of all the models in the order they were built in this notebook.

How to facilitate the process of visa approvals...

From this analysis we have learned that certain applications are more likely to be approved.

An application for a certified case looks like this:

12.7% of the dataset meets the ideal profile of certified applications and 99% of those observations were actually approved.

If the department receives 800,000 applications next year, our model and our "ideal profile" will help us process 96,000 applications for approval at once.

Then the model can be used to organize the applications into bins for faster processing.

Actionable Insights and Recommendations